Vanilla differential expression on Packer C. elegans data using single-cell Variational Inference (scVI)

Processing and visualizing 89k cells from Packer et al. 2019 C. elegans 10xv2 single cell data

Original article: A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution

https://science.sciencemag.org/content/365/6459/eaax1971.long

The anndata object we provide has 89701 cells and 20222 genes. It includes short gene descriptions from WormBase that will show up when mousing over the interactive plots.

Steps performed:

  1. Loading the data from anndata containing cell labels and gene descriptions
  2. Training the model with batch labels for integration with scVI
  3. Retrieving the scVI latent space and imputed values
  4. Visualize the latent space with t-SNE Plotly
  5. Perform differential expression and visualize with interactive volcano plot and heatmap using Plotly
In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from scvi.dataset import GeneExpressionDataset
from scvi.models import VAE
from scvi.inference import UnsupervisedTrainer
import torch
import anndata

import plotly.express as px
import plotly.graph_objects as go

from openTSNE import TSNE
from openTSNE.callbacks import ErrorLogger
[2020-02-28 17:54:43,621] INFO - scvi._settings | Added StreamHandler with custom formatter to 'scvi' logger.
/home/edaveiga/.local/lib/python3.7/site-packages/sklearn/utils/linear_assignment_.py:21: DeprecationWarning: The linear_assignment_ module is deprecated in 0.21 and will be removed from 0.23. Use scipy.optimize.linear_sum_assignment instead.
  DeprecationWarning)
In [2]:
# Change the path where the models will be saved 
save_path = "./"
vae_file_name = 'worm_vae.pkl'

if os.path.isfile('packer.h5ad'):
    print ("Found the data file! No need to download.")
else:
    print ("Downloading data...")
    ! wget https://github.com/Munfred/worm-notebooks/releases/download/Packer/packer.h5ad
Downloading data...
--2020-02-28 17:54:46--  https://github.com/Munfred/worm-notebooks/releases/download/Packer/packer.h5ad
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/222366268/6945e500-0f75-11ea-9b36-234e989d9d0f?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200229%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200229T015447Z&X-Amz-Expires=300&X-Amz-Signature=8462821e7a7b5e4dd116186e4c5f60c53d2ee5d9ab97ffecd8cec10d29ca0b08&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dpacker.h5ad&response-content-type=application%2Foctet-stream [following]
--2020-02-28 17:54:47--  https://github-production-release-asset-2e65be.s3.amazonaws.com/222366268/6945e500-0f75-11ea-9b36-234e989d9d0f?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200229%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200229T015447Z&X-Amz-Expires=300&X-Amz-Signature=8462821e7a7b5e4dd116186e4c5f60c53d2ee5d9ab97ffecd8cec10d29ca0b08&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dpacker.h5ad&response-content-type=application%2Foctet-stream
Resolving github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com)... 52.216.76.92
Connecting to github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com)|52.216.76.92|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 684895176 (653M) [application/octet-stream]
Saving to: ‘packer.h5ad’

100%[======================================>] 684,895,176 28.3MB/s   in 20s    

2020-02-28 17:55:07 (32.9 MB/s) - ‘packer.h5ad’ saved [684895176/684895176]

In [3]:
adata = anndata.read('packer.h5ad')
In [4]:
adata
Out[4]:
AnnData object with n_obs × n_vars = 89701 × 20222 
    obs: 'cell', 'numi', 'time_point', 'batch', 'size_factor', 'cell_type', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time', 'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage', 'passed_qc'
    var: 'gene_id', 'gene_name', 'gene_description'
In [5]:
adata.X
Out[5]:
<89701x20222 sparse matrix of type '<class 'numpy.float32'>'
	with 82802059 stored elements in Compressed Sparse Column format>

Take a look at the gene descriptions

The gene descriptions were taken using the WormBase API.

In [6]:
display(adata.var.head().style.set_properties(subset=['gene_description'], **{'width': '600px'}))
gene_id gene_name gene_description
index
WBGene00010957 WBGene00010957 nduo-6 Is affected by several genes including daf-16; daf-12; and hsf-1 based on RNA-seq and tiling array studies. Is affected by six chemicals including Rotenone; Psoralens; and metformin based on RNA-seq and microarray studies.
WBGene00010958 WBGene00010958 ndfl-4 Is enriched in Psub2 based on RNA-seq studies. Is affected by several genes including daf-16; daf-12; and clk-1 based on RNA-seq and microarray studies. Is affected by six chemicals including Alovudine; Psoralens; and metformin based on RNA-seq studies.
WBGene00010959 WBGene00010959 nduo-1 Is an ortholog of human MT-ND1 (mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 1). Is predicted to contribute to NADH dehydrogenase activity. Human ortholog(s) of this gene are implicated in Leber hereditary optic neuropathy and MELAS syndrome.
WBGene00010960 WBGene00010960 atp-6 Is predicted to contribute to proton-transporting ATP synthase activity, rotational mechanism.
WBGene00010961 WBGene00010961 nduo-2 Is affected by several genes including hsf-1; clk-1; and elt-2 based on RNA-seq and microarray studies. Is affected by eight chemicals including stavudine; Zidovudine; and Psoralens based on RNA-seq and microarray studies.
In [7]:
adata.obs.head().T
Out[7]:
index AAACCTGAGACAATAC-300.1.1 AAACCTGAGGGCTCTC-300.1.1 AAACCTGAGTGCGTGA-300.1.1 AAACCTGAGTTGAGTA-300.1.1 AAACCTGCAAGACGTG-300.1.1
cell AAACCTGAGACAATAC-300.1.1 AAACCTGAGGGCTCTC-300.1.1 AAACCTGAGTGCGTGA-300.1.1 AAACCTGAGTTGAGTA-300.1.1 AAACCTGCAAGACGTG-300.1.1
numi 1630 2319 3719 4251 1003
time_point 300_minutes 300_minutes 300_minutes 300_minutes 300_minutes
batch Waterston_300_minutes Waterston_300_minutes Waterston_300_minutes Waterston_300_minutes Waterston_300_minutes
size_factor 1.02319 1.45821 2.33828 2.65905 0.62961
cell_type Body_wall_muscle nan nan Body_wall_muscle Ciliated_amphid_neuron
cell_subtype BWM_head_row_1 nan nan BWM_anterior AFD
plot_cell_type BWM_head_row_1 nan nan BWM_anterior AFD
raw_embryo_time 360 260 270 260 350
embryo_time 380 220 230 280 350
embryo_time_bin 330-390 210-270 210-270 270-330 330-390
raw_embryo_time_bin 330-390 210-270 270-330 210-270 330-390
lineage MSxpappp MSxapaap nan Dxap ABalpppapav/ABpraaaapav
passed_qc True True True True True

Loading data

We load the Packer data and use the batch annotations for scVI. Here, each experiment correspond to a batch.

In [8]:
gene_dataset = GeneExpressionDataset()

# we provide the `batch_indices` so that scvi can perform batch correction
gene_dataset.populate_from_data(
            adata.X,
            gene_names=adata.var.index.values,
            cell_types=adata.obs['cell_type'].values,
            batch_indices=adata.obs['batch'].cat.codes.values,
            )

# We select the 1000 most variable genes, which is the default selection criteria of scvi
gene_dataset.subsample_genes(1000)
sel_genes = gene_dataset.gene_names
[2020-02-28 17:55:09,567] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2020-02-28 17:55:09,570] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2020-02-28 17:55:11,069] INFO - scvi.dataset.dataset | Downsampling from 20222 to 1000 genes
[2020-02-28 17:55:11,131] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-02-28 17:55:11,448] INFO - scvi.dataset.dataset | Filtering non-expressing cells.
[2020-02-28 17:55:11,574] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-02-28 17:55:11,879] INFO - scvi.dataset.dataset | Downsampled from 89701 to 89701 cells
In [9]:
adata.obs['cell_type'].values
Out[9]:
[Body_wall_muscle, nan, nan, Body_wall_muscle, Ciliated_amphid_neuron, ..., Rectal_gland, nan, nan, nan, nan]
Length: 89701
Categories (37, object): [ABarpaaa_lineage, Arcade_cell, Body_wall_muscle, Ciliated_amphid_neuron, ..., hmc_and_homolog, hmc_homolog, hyp1V_and_ant_arc_V, nan]

Training

  • n_epochs: Maximum number of epochs to train the model.
    If the likelihood change is small than a set threshold training will stop automatically.
  • lr: learning rate. Set to 0.001 here.
  • use_cuda: Set to true to use CUDA (GPU required)
In [10]:
# for this dataset 5 epochs is sufficient 
n_epochs = 5
lr = 1e-3
use_cuda = False

We now create the model and the trainer object. We train the model and output model likelihood every epoch. In order to evaluate the likelihood on a test set, we split the datasets (the current code can also so train/validation/test).

If a pre-trained model already exist in the save_path then load the same model rather than re-training it. This is particularly useful for large datasets.

In [11]:
# set the VAE to perform batch correction
vae = VAE(gene_dataset.nb_genes, n_batch=gene_dataset.n_batches)
In [12]:
trainer = UnsupervisedTrainer(
    vae,
    gene_dataset,
    train_size=0.75, # number between 0 and 1, default 0.8
    use_cuda=use_cuda,
    frequency=1,
)
In [13]:
# check if a previously trained model already exists, if yes load it

full_file_save_path = os.path.join(save_path, vae_file_name)

if os.path.isfile(full_file_save_path):
    trainer.model.load_state_dict(torch.load(full_file_save_path))
    trainer.model.eval()
else:
    trainer.train(n_epochs=n_epochs, lr=lr)
    torch.save(trainer.model.state_dict(), full_file_save_path)
training: 100%|██████████| 5/5 [08:11<00:00, 98.35s/it]

Plotting the likelihood change across training

In [14]:
train_test_results = pd.DataFrame(trainer.history).rename(columns={'elbo_train_set':'Train', 'elbo_test_set':'Test'})

train_test_results
Out[14]:
Train Test
0 8552.526833 8709.827793
1 814.322880 819.643316
2 772.865531 778.551824
3 755.723419 761.244525
4 741.715544 747.397204
5 734.052246 739.964843
In [15]:
ax = train_test_results.plot()
ax.set_xlabel("Epoch")
ax.set_ylabel("Error")
plt.show()

Obtaining the posterior object and sample latent space

The posterior object contains a model and a gene_dataset, as well as additional arguments that for Pytorch's DataLoader. It also comes with many methods or utilities querying the model, such as differential expression, imputation and differential analyisis.

To get an ordered output result, we might use .sequential posterior's method which return another instance of posterior (with shallow copy of all its object references), but where the iteration is in the same ordered as its indices attribute.

In [16]:
full = trainer.create_posterior(trainer.model, gene_dataset, indices=np.arange(len(gene_dataset)))
latent, batch_indices, labels = full.sequential().get_latent()
batch_indices = batch_indices.ravel()
In [17]:
# scvi tutorial latent space code
full = trainer.create_posterior(trainer.model, gene_dataset, indices=np.arange(len(gene_dataset)))
latent, batch_indices, labels = full.sequential().get_latent()
batch_indices = batch_indices.ravel()
latent.shape
Out[17]:
(89701, 10)
In [18]:
# store the latent space in a new anndata object
post_adata = anndata.AnnData(X=gene_dataset.X)
post_adata.obs=adata.obs
post_adata.obsm["latent"] = latent
post_adata
Out[18]:
AnnData object with n_obs × n_vars = 89701 × 1000 
    obs: 'cell', 'numi', 'time_point', 'batch', 'size_factor', 'cell_type', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time', 'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage', 'passed_qc'
    obsm: 'latent'

Plotting t-SNE

Using Plotly's Scattergl we can easily and speedily make interactive plots with 89k cells!

In [19]:
# here's a hack to randomize categorical colors, since plotly can't do that in a straightforward manner
# we take the list of named css colors that it recognizes, and we picked a color based on the code of
# the cluster we are coloring
css_colors=[
'aliceblue','antiquewhite','aqua','aquamarine','azure','bisque','black','blanchedalmond','blue',
'blueviolet','brown','burlywood','cadetblue','chartreuse','chocolate','coral','cornflowerblue',
'crimson','cyan','darkblue','darkcyan','darkgoldenrod','darkgray','darkgrey','darkgreen','darkkhaki',
'darkmagenta','darkolivegreen','darkorange','darkorchid','darkred','darksalmon','darkseagreen',
'darkslateblue','darkslategray','darkslategrey','darkturquoise','darkviolet','deeppink','deepskyblue',
'dimgray','dimgrey','dodgerblue','firebrick','floralwhite','forestgreen','fuchsia','gainsboro','ghostwhite',
'gold','goldenrod','gray','grey','green','greenyellow','honeydew','hotpink','indianred','indigo',
'ivory','khaki','lavender','lavenderblush','lawngreen','lemonchiffon','lightblue','lightcoral','lightcyan',
'lightgoldenrodyellow','lightgray','lightgrey','lightgreen','lightpink','lightsalmon','lightseagreen',
'lightskyblue','lightslategray','lightslategrey','lightsteelblue','lightyellow','lime','limegreen','linen',
'magenta','maroon','mediumaquamarine','mediumblue','mediumorchid','mediumpurple','mediumseagreen',
'mediumslateblue','mediumspringgreen','mediumturquoise','mediumvioletred','midnightblue','mintcream',
'mistyrose','moccasin','navajowhite','navy','oldlace','olive','olivedrab','orange','orangered','orchid',
'palegoldenrod','palegreen','paleturquoise','palevioletred','papayawhip','peachpuff','peru','pink','plum'
,'powderblue','purple','red','rosybrown','royalblue','saddlebrown','salmon','sandybrown','seagreen',
'seashell','sienna','silver','skyblue','slateblue','slategray','slategrey','snow','springgreen','steelblue',
'tan','teal','thistle','tomato','turquoise','violet','wheat','white','whitesmoke','yellow','yellowgreen']

# we just repeat the list of colors a bunch of times to ensure we always have more colors than clusters
css_colors = css_colors*100

# now define a function to plot any embedding

def plot_embedding(embedding_kind,              # the embedding must be a label in the post_adata.obsm
                   adata=adata,                 # the original adata for taking the cluster labels
                   post_adata=post_adata,
                   cluster_feature ='cell_type', 
                   xlabel="Dimension 1", 
                   ylabel="Dimension 2", 
                   plot_title="Embedding on single cell data"):

    # `cluster_feature` should be the name of one of the categorical annotation columns
    # e.g. `cell_type`, `cell_subtype`, `time_point` 

    cluster_ids = adata.obs[cluster_feature].cat.codes.unique()
    id_to_cluster_map = dict( zip( adata.obs[cluster_feature].cat.codes, adata.obs[cluster_feature] ) )
    cluster_to_id_map  = dict([[v,k] for k,v in id_to_cluster_map.items()])

    fig = go.Figure()
    for _cluster_id in adata.obs[cluster_feature].cat.codes.unique():

        fig.add_trace(    
            go.Scattergl(
                  x = post_adata.obsm[embedding_kind][:,0][post_adata.obs[cluster_feature].cat.codes==_cluster_id]
                , y = post_adata.obsm[embedding_kind][:,1][post_adata.obs[cluster_feature].cat.codes==_cluster_id]
                , mode='markers'
                , marker=dict(
                    # we randomize colors by starting at a random place in the list of named css colors 
                    color=css_colors[_cluster_id+np.random.randint(0,len(np.unique(css_colors)))]
                    , size = 3
                        )
                , showlegend=True
                , name=id_to_cluster_map[_cluster_id]
                , hoverinfo=['name']
                )
            )

    layout={
        "title": {"text": plot_title
                  , 'x':0.5        
                 }
        , 'xaxis': {'title': {"text": xlabel}}
        , 'yaxis': {'title': {"text": ylabel}}
        , "height": 1000
        , "width":1400
    }
    fig.update_layout(layout)
    fig.update_layout(showlegend=True)
    return fig

Plot interactive t-SNE of cell types

In [20]:
tsne = TSNE(callbacks=ErrorLogger(),
            initialization='random',
            negative_gradient_method='fft',
            callbacks_every_iters=200,
            n_iter=1400,
            neighbors='approx')

tsne_embedding = tsne.fit(latent)
post_adata.obsm["tsne"] = np.array(tsne_embedding)
print('t-SNE embedding done!')
Iteration  200, KL divergence  6.7903, 200 iterations in 37.0374 sec
Iteration  200, KL divergence  4.4715, 200 iterations in 37.0448 sec
Iteration  400, KL divergence  3.9704, 200 iterations in 36.8763 sec
Iteration  600, KL divergence  3.6815, 200 iterations in 39.5945 sec
Iteration  800, KL divergence  3.4848, 200 iterations in 46.3863 sec
Iteration  1000, KL divergence  3.3388, 200 iterations in 52.4479 sec
Iteration  1200, KL divergence  3.2249, 200 iterations in 59.4345 sec
Iteration  1400, KL divergence  3.1329, 200 iterations in 75.2592 sec
t-SNE embedding done!
In [21]:
fig = plot_embedding(embedding_kind='tsne', 
               cluster_feature ='cell_type', 
               xlabel="t-SNE 1", 
               ylabel="t-SNE 2", 
               plot_title="t-SNE on scVI latent space for Packer C. elegans single cell data")
In [22]:
# uncomment this line to save an interactive html plot, in case it is not rendering
# fig.write_html('worms_interactive_tsne.html')
fig.show()

Performing Differential Expression

Note: scVI recently introduced a second way to perform DE, and some functions and documentation are still changing. The old mode is called vanilla and is performed here. The new mode is called change and is not done in this notebook.

  • From the trained VAE model we can sample the gene expression rate for each gene in each cell.
  • For two populations of interest, we can then randomly sample pairs of cells, one from each population to compare their expression rate for a gene.
  • In the vanilla mode, DE is measured by logit(p/(1-p)) where p is the probability of a cell from population A having a higher expression than a cell from population B.
  • We can form the null distribution of the DE values by sampling pairs randomly from the combined population.

Explanation of Vanilla DE mode (performed in this notebook)

Explanation adapted from the scVi differential_expression_score docstring.

The vanilla mode follows protocol described in Lopez et al, arXiv:1709.02082.

In this case for a given gene we perform hypothesis testing based on latent variable in the generative model that models the mean of the gene expression.

We are comparing $h_{1g}$, the mean expression of gene $g$ in cell type 1, with $h_{2g}$, the mean expression of $g$ in cell type 2.

The hypotheses are:

$$ M^g_1: h_{1g} > h_{2g} $$$$ M^g_2: h_{1g} \leq h_{2g} $$

DE between cell types 1 and 2 for each gene can then be based on the Bayes factors:

$$ \text{Natural Log Bayes Factor for gene g in cell types 1 and 2} = \ln ( {BF^g_{12}) = \ln(\frac{ p(M^g_1 | x_1, x_2)}{p(M^g_2 | x_1, x_2)}}) $$

Note that the scvi differential_expression_score returns the natural logarithm of the Bayes Factor.
This is $\ln(BF_{10})$ in the table discussed below.

To compute the gene specific Bayes factors using masks idx1 and idx2 we sample the Posterior in the following way:

  1. The posterior is sampled n_samples times for each subpopulation

  2. For computation efficiency (posterior sampling is quite expensive), instead of comparing element-wise the obtained samples, we can permute posterior samples.

Remember that computing the Bayes Factor requires sampling $q(z_A | x_A)$ and $q(z_B | x_B)$

Interpreting Bayes factors

To learn more about Bayes factors vs. p-values, see the review On p-Values and Bayes Factors by Leonhard Held and Manuela Ott.

For a shorter overview, see this blog post. A common interpretation table is copied below.
In our notation, $BF_{10}$ is $BF^g_{12}$, $H_0$ is $M^g_1$ and $H_1$ is $M^g_2$

Bayes factor $BF_{10}$ $\ln(BF_{10})$ Interpretation
> 100 > 4.60 Extreme evidence for H1
30 – 100 (3.4, 4.6) Very strong evidence for H1
10 – 30 (2.3, 3.4) Strong evidence for H1
3 – 10 (1.1, 2.3) Moderate evidence for H1
1 – 3 (0 , 1.1) Anecdotal evidence for H1
1 0 No evidence
1/3 – 1 (-1.1, 0) Anecdotal evidence for H0
1/3 – 1/10 (-2.30, -1.1) Moderate evidence for H0
1/10 – 1/30 (-3.4, -2.30) Strong evidence for H0
1/30 – 1/100 (-4.6, -3.4) Very strong evidence for H0
< 1/100 < -4.6 Extreme evidence for H0

Selecting cells to compare

In [23]:
# let's take a look at abundances of different cell types
adata.obs['cell_type'].value_counts()
Out[23]:
nan                               35052
Body_wall_muscle                  17520
Hypodermis                         7746
Ciliated_amphid_neuron             6090
Ciliated_non_amphid_neuron         4468
Seam_cell                          2766
Pharyngeal_muscle                  2562
Glia                               1857
Intestine                          1732
Pharyngeal_neuron                  1471
Pharyngeal_marginal_cell            911
Coelomocyte                         787
Pharyngeal_gland                    786
GLR                                 768
Intestinal_and_rectal_muscle        568
Germline                            499
Pharyngeal_intestinal_valve         493
Arcade_cell                         434
Z1_Z4                               372
Rectal_cell                         327
M_cell                              315
ABarpaaa_lineage                    273
Rectal_gland                        265
Excretory_cell                      215
Excretory_gland                     205
hmc                                 189
hmc_homolog                         155
T                                   141
hmc_and_homolog                     122
Parent_of_exc_gland_AVK             114
hyp1V_and_ant_arc_V                 112
Excretory_duct_and_pore              91
Parent_of_hyp1V_and_ant_arc_V        75
G2_and_W_blasts                      72
Excretory_cell_parent                62
Parent_of_exc_duct_pore_DB_1_3       61
XXX                                  25
Name: cell_type, dtype: int64
In [24]:
# let's pick two cell types
cell_type_1 = 'Ciliated_non_amphid_neuron'
cell_type_2 = 'Intestine'


cell_idx1 = adata.obs['cell_type'] == cell_type_1
print(sum(cell_idx1), 'cells of type', cell_type_1)
cell_idx2 = adata.obs['cell_type'] == cell_type_2
print(sum(cell_idx2), 'cells of type', cell_type_2)
4468 cells of type Ciliated_non_amphid_neuron
1732 cells of type Intestine

Vanilla DE parameters

  • n_samples: the number of times to sample the posterior gene frequencies from the vae model for each gene in each cell.
  • M_permutation: Number of pairs sampled for comparison.
  • idx1: boolean array masking subpopulation cells 1. (True where cell is from population)
  • idx2: boolean array masking subpopulation cells 2. (True where cell is from population)
In [25]:
n_samples = 10000
M_permutation = 10000
In [26]:
de_res = full.differential_expression_score(
    cell_idx1, 
    cell_idx2, 
    n_samples=n_samples, 
    M_permutation=M_permutation,
)
  • bayesi: The bayes factor for cell type 1 having a higher expression than cell type 2
  • bayesi_permuted: estimate Bayes Factors of random populations of the union of the two cell populations
  • meani: average UMI counts in cell type i
  • nonzi: proportion of non-zero expression in cell type i
  • norm_meani: average UMI counts in cell type i normalized by cell size
  • scalei: average scVI imputed gene expression scale in cell type i
In [27]:
de_res.head()
Out[27]:
bayes1 bayes1_permuted bayes2 bayes2_permuted mean1 mean2 nonz1 nonz2 norm_mean1 norm_mean2 scale1 scale2
WBGENE00185075 18.420681 -0.022401 -7.600382 -0.001200 0.0 0.0 0.0 0.0 3.855519 0.044794 0.006409 0.000048
WBGENE00018352 9.210140 -0.026402 -7.823621 -0.007600 0.0 0.0 0.0 0.0 0.358880 0.007970 0.002788 0.000022
WBGENE00003175 8.516943 -0.012000 -7.417964 -0.021601 0.0 0.0 0.0 0.0 10.781455 0.025461 0.004882 0.000022
WBGENE00015062 8.111395 -0.007600 -7.263716 -0.011600 0.0 0.0 0.0 0.0 0.094993 0.006145 0.003122 0.000024
WBGENE00010898 8.111395 -0.016000 -7.600382 -0.025601 0.0 0.0 0.0 0.0 3.656473 0.007220 0.001741 0.000011
In [28]:
# manipulate the DE results for plotting
de = de_res.copy()

# we compute the ratio of the scVI scales to use that as a rough proxy for fold change
de['ratio_scale12']=de['scale1']/de['scale2']
de['log_scale_ratio']=np.log2(de['ratio_scale12'])

# we take absolute values of the first bayes factor as the one to use on the volcano plot
# bayes1 and bayes2 should be roughtly the same, except with opposite signs
de['abs_bayes_factor']=np.abs(de['bayes1'])

#make a copy of the annotated gene metadata with gene ids all lower case to avoid problems when joining dataframes
adata_var_uppercase = adata.var.copy()
adata_var_uppercase.index = adata_var_uppercase.index.str.upper()

#convert top_expression gene ids index to lowercase for joining with metadata
de=de.join(adata_var_uppercase, how='inner')
In [29]:
de.head().T
Out[29]:
WBGENE00185075 WBGENE00018352 WBGENE00003175 WBGENE00015062 WBGENE00010898
bayes1 18.4207 9.21014 8.51694 8.11139 8.11139
bayes1_permuted -0.0224009 -0.0264015 -0.0120001 -0.00760004 -0.0160003
bayes2 -7.60038 -7.82362 -7.41796 -7.26372 -7.60038
bayes2_permuted -0.0012 -0.00760004 -0.0216008 -0.0116001 -0.0256014
mean1 0 0 0 0 0
mean2 0 0 0 0 0
nonz1 0 0 0 0 0
nonz2 0 0 0 0 0
norm_mean1 3.85552 0.35888 10.7815 0.0949926 3.65647
norm_mean2 0.044794 0.00797038 0.0254612 0.00614483 0.00722001
scale1 0.00640919 0.00278759 0.00488169 0.00312245 0.00174146
scale2 4.75148e-05 2.20513e-05 2.17731e-05 2.37333e-05 1.08099e-05
ratio_scale12 134.888 126.414 224.208 131.564 161.098
log_scale_ratio 7.07562 6.98201 7.80869 7.03962 7.3318
abs_bayes_factor 18.4207 9.21014 8.51694 8.11139 8.11139
gene_id WBGene00185075 WBGene00018352 WBGene00003175 WBGene00015062 WBGene00010898
gene_name C03C10.9 fbxc-20 mec-12 trx-1 nphp-1
gene_description None Is enriched in male and pharynx based on micro... Is an ortholog of human TUBA1A (tubulin alpha ... Is an ortholog of human TXN (thioredoxin). Exh... Is an ortholog of human NPHP1 (nephrocystin 1)...

Volcano plot of DE between two clusters with gene descriptions

Because we're using the vanilla mode, note that this volcano plot shows the ratios of the scVI expression scale of the two tissues vs the absolute value of the natural log bayes factor.

The new change mode allows for calculating log fold change and p-values, which are more commonly seen in volcano plots

We can highlight genes of interest based on simple string matching. For example, the cell below highlights all C. elegans neuropeptides (whose name conveniently all start with nlp, ins or flp). Other genes with be a transparent gray dot.

In [30]:
de['gene_color'] = 'rgba(100, 100, 100, 0.25)'
de['gene_color'][de['gene_name'].str.contains('ins-')] = 'rgba(255, 0,0, 1)'
de['gene_color'][de['gene_name'].str.contains('flp-')] = 'rgba(0, 255,0, 1)'
de['gene_color'][de['gene_name'].str.contains('nlp-')] = 'rgba(0, 0,255, 1)'
/home/edaveiga/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/home/edaveiga/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/home/edaveiga/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [31]:
# first we create these variables to customize the hover text in plotly's heatmap
# the text needs to be arranged in a matrix the same shape as the heatmap
# for the gene descriptions text, which can be several sentences, we add a line break after each sentence
de['gene_description_html'] = de['gene_description'].str.replace('\. ', '.<br>')

fig = go.Figure(
                data=go.Scatter(
                          x=de["log_scale_ratio"].round(3)
                        , y=de["abs_bayes_factor"].round(3)
                        , mode='markers'
                        , marker=dict(color=de['gene_color'])
                        , hoverinfo='text'
                        , text=de['gene_description_html']
                        , customdata=de.gene_id.values + '<br>Name: ' + de.gene_name.values
                        , hovertemplate='%{customdata} <br>' +
                                        '|ln(BF)|: %{y}<br>' +
                                        'Log2 scale ratio: %{x}' +
                                        '<extra>%{text}</extra>'
                        )
                        , layout= {
                                "title": {"text": 
                                          "Differential expression on of Packer C. elegans single cell data between cell types <br> <b>" +
                                          str(cell_type_1) + "</b> and <b>" + str(cell_type_2) + " "
                                          , 'x':0.5        
                                         }
                                , 'xaxis': {'title': {"text": "Log2 of scVI expression scale"}}
                                , 'yaxis': {'title': {"text": "Absolute value of natural log of Bayes Factor"}}
                        }
               )
# uncomment line below to save the interactive volcano plot as html
# fig.write_html('worms_interactive_volcano_plot.html')
fig.show()

Heatmap of top expressed genes with gene descriptions

Now we perform DE between each cell type vs all other cells and make a heatmap of the result.

First we need to make cell type sumary with numerical codes for each cell type

In [32]:
# we need to numerically encode the cell types for passing the cluster identity to scVI
cell_code_to_type = dict( zip( adata.obs['cell_type'].cat.codes, adata.obs['cell_type'] ) )
cell_type_to_code_map = dict([[v,k] for k,v in cell_code_to_type.items()])
# check that we got unique cell type labels
assert len(cell_code_to_type)==len(cell_type_to_code_map)

cell_types_summary=pd.DataFrame(index=adata.obs['cell_type'].value_counts().index)
cell_types_summary['cell_type_code']=cell_types_summary.index.map(cell_type_to_code_map)
cell_types_summary['ncells']=adata.obs['cell_type'].value_counts()
cell_types_summary['cell_type_name']=adata.obs['cell_type'].value_counts().index
cell_types_summary.to_csv('packer_cell_types_summary.csv')
cell_types_summary.head()
Out[32]:
cell_type_code ncells cell_type_name
nan 36 35052 nan
Body_wall_muscle 2 17520 Body_wall_muscle
Hypodermis 14 7746 Hypodermis
Ciliated_amphid_neuron 3 6090 Ciliated_amphid_neuron
Ciliated_non_amphid_neuron 4 4468 Ciliated_non_amphid_neuron
In [33]:
# create a column in the cell data with the cluster id each cell belongs to
adata.obs['cell_type_code'] = adata.obs['cell_type'].cat.codes

# this returns a list of dataframes with DE results (one for each cluster), and a list with the corresponding cluster id
per_cluster_de, cluster_id = full.one_vs_all_degenes(cell_labels=adata.obs['cell_type_code'].ravel(), min_cells=1)
In [34]:
# pick the top 10 genes in each cluster
top_genes = []
for x in per_cluster_de:
    top_genes.append(x[:10])
top_genes = pd.concat(top_genes)
top_genes = np.unique(top_genes.index)
In [35]:
# fetch the expression values for the top 10 genes
top_expression = [x.filter(items=top_genes, axis=0)['norm_mean1'] for x in per_cluster_de]
top_expression = pd.concat(top_expression, axis=1)
top_expression = np.log10(1 + top_expression)
cluster_name = [cell_code_to_type[_id] for _id in cluster_id]
top_expression.columns=cluster_name

# convert into anndata object to tie with more metadata, such as gene names and descriptions
top_expression = anndata.AnnData(top_expression.T)
top_expression.obs = top_expression.obs.join(cell_types_summary)
top_expression.obs.head()
Out[35]:
cell_type_code ncells cell_type_name
ABarpaaa_lineage 0 273 ABarpaaa_lineage
Arcade_cell 1 434 Arcade_cell
Body_wall_muscle 2 17520 Body_wall_muscle
Ciliated_amphid_neuron 3 6090 Ciliated_amphid_neuron
Ciliated_non_amphid_neuron 4 4468 Ciliated_non_amphid_neuron
In [36]:
#make a copy of the annotated gene metadata with gene ids all lower case to avoid problems when joining dataframes
adata_var_lowcase = adata.var.copy()
adata_var_lowcase.index = adata_var_lowcase.index.str.lower()

#convert top_expression gene ids index to lowercase for joining with metadata
top_expression.var.index = top_expression.var.index.str.lower()
top_expression.var=top_expression.var.join(adata_var_lowcase)

top_expression.var.index=top_expression.var['gene_id']
top_expression.var.head().style.set_properties(subset=['gene_description'], **{'width': '600px'})
Out[36]:
gene_id gene_name gene_description
gene_id
WBGene00000064 WBGene00000064 act-2 Is an ortholog of human ACTB (actin beta). Is predicted to have ATP binding activity. Is involved in several processes, including cortical actin cytoskeleton organization; cytoskeleton-dependent cytokinesis; and embryo development. Localizes to the actin filament and cell cortex. Is expressed in body wall musculature; gonad; nervous system; and the hypodermis. Human ortholog(s) of this gene are implicated in several diseases, including artery disease (multiple); intrinsic cardiomyopathy (multiple); and myopathy (multiple).
WBGene00000229 WBGene00000229 atp-2 Is an ortholog of human ATP5F1B (ATP synthase F1 subunit beta). Exhibits protein domain specific binding activity. Is predicted to contribute to ATPase activity and proton-transporting ATP synthase activity, rotational mechanism. Is involved in several processes, including determination of adult lifespan; inductive cell migration; and pharyngeal pumping. Localizes to the mitochondrion and non-motile cilium. Is expressed in HOB; head; ray neuron type B; and tail.
WBGene00000370 WBGene00000370 ccg-1 Is enriched in PLM; coelomocyte; germ line; and muscle cell based on tiling array; RNA-seq; and microarray studies. Is affected by several genes including daf-16; daf-2; and glp-1 based on microarray; proteomic; and RNA-seq studies. Is affected by twelve chemicals including aldicarb; Rotenone; and D-glucose based on microarray and RNA-seq studies.
WBGene00000429 WBGene00000429 ceh-2 Is an ortholog of human EMX1 (empty spiracles homeobox 1). Is predicted to have DNA binding activity. Localizes to the nucleus. Is expressed in nervous system; pharynx; somatic nervous system; and vulva.
WBGene00000431 WBGene00000431 ceh-6 Is an ortholog of human POU3F4 (POU class 3 homeobox 4). Is predicted to have DNA-binding transcription factor activity. Is involved in excretion; transdifferentiation; and water homeostasis. Localizes to the cytosol and nucleus. Is expressed in body wall musculature; head neurons; hyp7 syncytium; intestine; and tail neurons. Human ortholog(s) of this gene are implicated in sensorineural hearing loss.

Create interactive heatmap with plotly

In this heatmap we plot the top 10 genes for each of the 37 annotated tissue types for a total of 242 genes.

In [37]:
# first we create these variables to customize the hover text in plotly's heatmap
# the text needs to be arranged in a matrix the same shape as the heatmap
# for the gene descriptions text, which can be several sentences, we add a line break after each sentence
top_expression.var['gene_description_html'] = top_expression.var['gene_description'].str.replace('\. ', '.<br>')
gene_description_text_matrix = np.tile(top_expression.var['gene_description_html'].values, (len(top_expression.obs['cell_type_name']),1) )
gene_ids_text_matrix = np.tile(top_expression.var['gene_id'].values, (len(top_expression.obs['cell_type_name']),1))


# now create the heatmap with plotly
fig = go.Figure(
                data=go.Heatmap(
                        z=top_expression.X.round(3),
                        x=top_expression.var['gene_name'],
                        y=top_expression.obs['cell_type_name'],
                        hoverinfo='text',
                        text=gene_description_text_matrix,
                        customdata=gene_ids_text_matrix,
                        hovertemplate='%{customdata} <br>Name: %{x}<br>Cell type: %{y}<br>Expression: %{z}  <extra>%{text}</extra>',
                        ),
                    layout= {
                        "title": {"text": "Differential expression of Packer C. elegans single cell data"},
                        "height": 800,
                        },
               )

# uncomment line below to save the interactive volcano plot as html
# fig.write_html('worms_interactive_heatmap.html')
fig.show()
In [ ]:
 
In [ ]: